This is the final project for METCS544 - Bank Churners.
Goal of the project: Analyze the customer attrition data and gather insights about what could be the contributing factors.
Link to dataset : Bank-Churners-Dataset
The dataset contains 8500 records of existing customers and 1627 records of attrited customers.
We will be analyzing the following :-
library(modeest)
options(digits=3)
library(sampling)
library(plotly)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'httr':
## method from
## print.response rmutil
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
setwd("C:/ARCHANA/Boston University MS Applied Data Analytics/METCS544 - Foundations of Analytics with R - Spring1/Final Project/")
raw.data <- read.csv("BankChurners.csv")
Retaining only the columns that are being used for analysis
raw.data <- raw.data[,names(raw.data) %in% c("Attrition_Flag","Customer_Age",
"Gender","Income_Category",
"Card_Category","Months_on_book")]
Looking for missing values in the data
unique(raw.data$Customer_Age)
## [1] 45 49 51 40 44 32 37 48 42 65 56 35 57 41 61 47 62 54 59 63 53 58 55 66 50
## [26] 38 46 52 39 43 64 68 67 60 73 70 36 34 33 26 31 29 30 28 27
unique(raw.data$Gender)
## [1] "M" "F"
unique(raw.data$Income_Category) # has 1112 unknown values
## [1] "$60K - $80K" "Less than $40K" "$80K - $120K" "$40K - $60K"
## [5] "$120K +" "Unknown"
nrow(raw.data[raw.data$Income_Category=="Unknown",])
## [1] 1112
unique(raw.data$Card_Category)
## [1] "Blue" "Gold" "Silver" "Platinum"
unique(raw.data$Months_on_book)
## [1] 39 44 36 34 21 46 27 31 54 30 48 37 56 42 49 33 28 38 41 43 45 52 40 50 35
## [26] 47 32 20 29 25 53 24 55 23 22 26 13 51 19 15 17 18 16 14
Income-Category field has unknown values, removing these rows from the raw data
raw.data <- subset(raw.data,Income_Category!="Unknown")
Split the raw data into 2 parts - Attrited and Existing Customers
existing.cust <- raw.data[raw.data$Attrition_Flag=="Existing Customer",names(raw.data) %in%
c("Attrition_Flag","Customer_Age","Gender","Income_Category",
"Card_Category","Months_on_book")]
attrited.cust <- raw.data[raw.data$Attrition_Flag=="Attrited Customer",names(raw.data) %in%
c("Attrition_Flag","Customer_Age","Gender","Income_Category",
"Card_Category","Months_on_book")]
Card Category refers to the different type of credit cards that are offered to customers - Blue, Silver, Gold & Platinum.
Here is the table representation to view the frequencies of card-category
All Customers -
all.card <- table(raw.data$Card_Category)
all.card
##
## Blue Gold Platinum Silver
## 8391 107 15 502
Attrited customers -
attr.card <- table(attrited.cust$Card_Category)
attr.card
##
## Blue Gold Platinum Silver
## 1343 19 3 75
sprintf("Percentage of customers who attrited - by Card Category - %s - %.2f%%",
rownames(attr.card),round((attr.card/all.card)*100,2))
## [1] "Percentage of customers who attrited - by Card Category - Blue - 16.01%"
## [2] "Percentage of customers who attrited - by Card Category - Gold - 17.76%"
## [3] "Percentage of customers who attrited - by Card Category - Platinum - 20.00%"
## [4] "Percentage of customers who attrited - by Card Category - Silver - 14.94%"
card.types <- c("Blue","Gold","Platinum","Silver")
plot_ly() %>%
add_trace(x = ~card.types, y = ~all.card, type = 'bar',
text = all.card, textposition = 'auto',name="All Customers",
marker = list(color = 'blue',
line = list(color = 'blue', width = 1.5))) %>%
add_trace(x = ~card.types, y = ~attr.card, type = 'bar',
text = attr.card, textposition = 'auto',name="Attrited Customers",
marker = list(color = 'red',
line = list(color = 'red', width = 1.5))) %>%
layout(title = "Distribution of Card Categories",
barmode = 'group',
xaxis = list(title = "Card Category"),
yaxis = list(title = "No. of Customers"))
INFERENCES -
We see that majority of the attrited(1343) customers held a blue category credit card.
Among the 4 card categories, Platinum has highest attrition rate - 20%
Analysis of distribution of card categories using barplot
Months-on-book refers to the number of months the customer has held the credit-card.
Here is a boxplot of months-on-book to analyze the spread
f <- fivenum(attrited.cust$Months_on_book)
out <- c(f[2]-1.5*(f[4]-f[2]),f[4]+1.5*(f[4]-f[2]))
plot_ly(attrited.cust, y = ~Months_on_book, type="box",
name = 'Months-on-book',quartilemethod="exclusive") %>%
layout(title="Months-on-book - Attrited Customers")
sprintf("The average months-on-book for attrited customers is %g",
round(mean(attrited.cust$Months_on_book)))
## [1] "The average months-on-book for attrited customers is 36"
sprintf("Most of the attrited customers were with the company for %s months",
mfv(attrited.cust$Months_on_book))
## [1] "Most of the attrited customers were with the company for 36 months"
INFERENCES -
The months-on-book for attrited customers ranges between 13 and 56 with about 11 outliers.
The interquartile range/middle 50% of the data is between 32 and 40 and the median of 36 separates top 50% of the data from bottom 50%.
The average is about the same as the median of 36.
Histogram to analyze frequencies of the months that customers stayed with the company -
hist(attrited.cust$Months_on_book,main="Months-on-book - Attrited
Customers",col="light blue",xlab="months-on-book")
INFERENCE -
Income Category refers to the income range of the customer -
Less than $40K, $40K-$60K, $60K-$80K, $80K-$120K, $120K+
Contingency table of income-category & card-category - Attrited customers
sorted.attr.data <- attrited.cust[order(attrited.cust$Income_Category),]
sorted.attr.data$Income_Category <- factor(sorted.attr.data$Income_Category,
levels=c("Less than $40K","$40K - $60K",
"$60K - $80K","$80K - $120K",
"$120K +"))
income.card <- table(sorted.attr.data$Income_Category,
sorted.attr.data$Card_Category)
income.card
##
## Blue Gold Platinum Silver
## Less than $40K 586 4 2 20
## $40K - $60K 257 2 1 11
## $60K - $80K 172 6 0 11
## $80K - $120K 215 5 0 22
## $120K + 113 2 0 11
INFERENCES -
We see that for customers with income of 120k+, none of them were holding a platinum card and very few held Gold cards. Platinum/Gold cards with better rewards/perks could have given these customers more interest to stay.
For all income-levels, we see a big disparity between number of customers holding blue vs silver - we could look into opportunities for upgrading/reduce attrition.
Marginal & conditional distribution of income-category & card-category
addmargins(income.card)
##
## Blue Gold Platinum Silver Sum
## Less than $40K 586 4 2 20 612
## $40K - $60K 257 2 1 11 271
## $60K - $80K 172 6 0 11 189
## $80K - $120K 215 5 0 22 242
## $120K + 113 2 0 11 126
## Sum 1343 19 3 75 1440
# income-category
round(prop.table(income.card,1),3)
##
## Blue Gold Platinum Silver
## Less than $40K 0.958 0.007 0.003 0.033
## $40K - $60K 0.948 0.007 0.004 0.041
## $60K - $80K 0.910 0.032 0.000 0.058
## $80K - $120K 0.888 0.021 0.000 0.091
## $120K + 0.897 0.016 0.000 0.087
# card-category
round(prop.table(income.card,2),3)
##
## Blue Gold Platinum Silver
## Less than $40K 0.436 0.211 0.667 0.267
## $40K - $60K 0.191 0.105 0.333 0.147
## $60K - $80K 0.128 0.316 0.000 0.147
## $80K - $120K 0.160 0.263 0.000 0.293
## $120K + 0.084 0.105 0.000 0.147
INFERENCE -
Mosaic plot - representation of contingency table
mosaicplot(income.card,color=c("pink","purple"),cex.axis=0.6,las=1,
xlab="Income Category",ylab="Card Category",
main="IncomeCategory vs CardCategory")
Analysis - Customer Income-Category/Card-Category/Gender
Breakdown for attrited customers -
aincome.card.gender <- table(sorted.attr.data$Income_Category,
sorted.attr.data$Card_Category,
sorted.attr.data$Gender)
ftable(aincome.card.gender)
## F M
##
## Less than $40K Blue 559 27
## Gold 4 0
## Platinum 2 0
## Silver 17 3
## $40K - $60K Blue 160 97
## Gold 2 0
## Platinum 0 1
## Silver 4 7
## $60K - $80K Blue 0 172
## Gold 0 6
## Platinum 0 0
## Silver 0 11
## $80K - $120K Blue 0 215
## Gold 0 5
## Platinum 0 0
## Silver 0 22
## $120K + Blue 0 113
## Gold 0 2
## Platinum 0 0
## Silver 0 11
Breakdown for existing customers -
sorted.exis.data <- existing.cust[order(existing.cust$Income_Category),]
sorted.exis.data$Income_Category <- factor(sorted.exis.data$Income_Category,
levels=c("Less than $40K","$40K - $60K",
"$60K - $80K","$80K - $120K",
"$120K +"))
eincome.card.gender <- table(sorted.exis.data$Income_Category,
sorted.exis.data$Card_Category,
sorted.exis.data$Gender)
ftable(eincome.card.gender)
## F M
##
## Less than $40K Blue 2587 230
## Gold 18 2
## Platinum 2 0
## Silver 95 15
## $40K - $60K Blue 800 618
## Gold 5 8
## Platinum 0 0
## Silver 43 45
## $60K - $80K Blue 0 1101
## Gold 0 23
## Platinum 0 4
## Silver 0 85
## $80K - $120K Blue 0 1180
## Gold 0 16
## Platinum 0 2
## Silver 0 95
## $120K + Blue 0 532
## Gold 0 16
## Platinum 0 4
## Silver 0 49
Bar Plots for existing vs attrited - by Gender -
male <- raw.data[raw.data$Gender=="M",]
female <- raw.data[raw.data$Gender=="F",]
mpercent <- paste0(round(table(male$Attrition_Flag)/nrow(male),2)*100,"%")
fpercent <- paste0(round(table(female$Attrition_Flag)/nrow(female),2)*100,"%")
subplot(
plot_ly(x=c("Attrited","Existing"),y=table(male$Attrition_Flag),type="bar",
name="Male") %>%
add_text(text=mpercent,textposition = "top",showlegend=FALSE),
plot_ly(x=c("Attrited","Existing"),y=table(female$Attrition_Flag),type="bar",
name="Female") %>%
add_text(text=fpercent,textposition = "top",showlegend=FALSE) %>%
layout(title="Attrition by Gender"),
shareY = TRUE)
Boxplot of customer’s age to analyze the spread -
subplot(
plot_ly(existing.cust, y = ~Customer_Age, type="box",
name = 'Existing',quartilemethod="exclusive"),
plot_ly(attrited.cust, y = ~Customer_Age, type="box",
name = 'Attrited',quartilemethod="exclusive") %>%
layout(title="Age Distribution - Existing vs Attrited Customers"),
shareY = TRUE)
sprintf("The average age for attrited customers is %g",
round(mean(attrited.cust$Customer_Age)))
## [1] "The average age for attrited customers is 47"
sprintf("The average age for existing customers is %g",
round(mean(existing.cust$Customer_Age)))
## [1] "The average age for existing customers is 46"
INFERENCE -
The age distribution of existing and attrited customers is approximately the same with few outliers for existing customers.
The average age for both attrited and existing customers is around 46 which is same as the median.
This may suggest that age may not be a big contributor to attrition.
Histogram to visualize the distribution -
par(mfrow=c(1,2))
hist(attrited.cust$Customer_Age,main="Attrited Customers",xlab="Customer's Age",
xlim=c(20,70),col="red")
hist(existing.cust$Customer_Age,main="Existing Customers",xlab="Customer's Age",
xlim=c(20,80),col="blue")
INFERENCE -
Probability Density Function - Attrited Customers -
mean.a <- mean(attrited.cust$Customer_Age)
sd.a <- sd(attrited.cust$Customer_Age)
data.a <- dnorm(attrited.cust$Customer_Age,mean=mean.a,sd=sd.a)
x.a <- seq(min(attrited.cust$Customer_Age),max(attrited.cust$Customer_Age),5)
plot(attrited.cust$Customer_Age,data.a,pch=19,main="PDF - Attrited Customers",
xlab="Customer's Age",ylab="Probability Density Function")
Cumulative Density Function - Attrited Customers -
data.b <- pnorm(attrited.cust$Customer_Age,mean=mean.a,sd=sd.a)
plot(attrited.cust$Customer_Age,data.b,pch=19,main="CDF - Attrited Customers",
xlab="Customer's Age",ylab="Cumulative Density Function")
We have taken 5000 samples of sizes 10,20,30 & 40 and computed sample-means for the same.
set.seed(2633)
data.mean <- round(mean(raw.data$Customer_Age),2)
data.sd <- round(sd(raw.data$Customer_Age),2)
# function to pick 5000 samples and compute sample-means
sample.func <- function(sample.size,sample.count) {
sample.means <- numeric(sample.count)
i <- 1
for (i in 1:sample.count) {
sample.means[i] <- mean(sample(raw.data$Customer_Age,sample.size,
replace = FALSE))
}
return (sample.means)
}
# sample-size: 10
sample.means10 <- sample.func(10,5000)
mean.sm10 <- mean(sample.means10)
sd.sm10 <- sd(sample.means10)
# sample-size: 20
sample.means20 <- sample.func(20,5000)
mean.sm20 <- mean(sample.means20)
sd.sm20 <- sd(sample.means20)
# sample-size: 30
sample.means30 <- sample.func(30,5000)
mean.sm30 <- mean(sample.means30)
sd.sm30 <- sd(sample.means30)
# sample-size: 40
sample.means40 <- sample.func(40,5000)
mean.sm40 <- mean(sample.means40)
sd.sm40 <- sd(sample.means40)
We then plot the distribution of sample-means for sample-size: 40
# Plot
sd3 <- c(mean.sm40-3*sd.sm40,mean.sm40+3*sd.sm40)
# Plot density of sample means
hist(sample.means40,main="Sample-means - Sample-size: 40",
col="blue",prob=TRUE,
xlab="Sample means - Customer's Age")
abline(v=sd3,col="red") # 3-sd from the mean
INFERENCE -
Analyzing the mean & standard-deviation of the sample-means -
sprintf("Data Mean: %.2f; Data Standard Deviation: %.2f",data.mean,data.sd)
## [1] "Data Mean: 46.33; Data Standard Deviation: 7.93"
sprintf("Sample-size: %g, Mean: %.2f, SD: %.2f",c(10,20,30,40),
c(mean.sm10,mean.sm20,mean.sm30,mean.sm40),
c(sd.sm10,sd.sm20,sd.sm30,sd.sm40))
## [1] "Sample-size: 10, Mean: 46.35, SD: 2.49"
## [2] "Sample-size: 20, Mean: 46.36, SD: 1.74"
## [3] "Sample-size: 30, Mean: 46.35, SD: 1.44"
## [4] "Sample-size: 40, Mean: 46.34, SD: 1.26"
sprintf("Data SD/sqrt(sample.size): %.2f; SD-SampleMeans: %.2f",
(data.sd/sqrt(40)),sd.sm40)
## [1] "Data SD/sqrt(sample.size): 1.25; SD-SampleMeans: 1.26"
INFERENCES -
Mean of data is about equal to the mean of the sample-means for all sample-sizes.
Standard Deviation decreases as the sample-size increases.
Standard Deviation of data/sqrt(sample.size) is equal to Standard Deviation of sample-means.
For each of the sampling methods used, we analyze the following -
How reflective is the sample of the actual data - split of existing vs attrited customers, split of card-categories.
Which card-category has the highest attrition compared to findings from the actual data.
Months-on-book analysis for attrited customers - does the sample reflect the findings from actual data.
Simple Random Sampling
Sample-size: 100
set.seed(2633)
srs <- srswor(n=100,N=nrow(raw.data))
srs.sample <- raw.data[srs!=0,]
paste("Fraction of Attrited vs Existing customers in the data - ")
## [1] "Fraction of Attrited vs Existing customers in the data - "
table(raw.data$Attrition_Flag)/nrow(raw.data)
##
## Attrited Customer Existing Customer
## 0.16 0.84
paste("Fraction of Attrited vs Existing customers picked by SimpleRandomSampling")
## [1] "Fraction of Attrited vs Existing customers picked by SimpleRandomSampling"
table(srs.sample$Attrition_Flag)/nrow(srs.sample)
##
## Attrited Customer Existing Customer
## 0.14 0.86
Split of card-categories in the sample selected -
table(srs.sample$Card_Category)
##
## Blue Silver
## 95 5
INFERENCES -
-We see that the samples picked through SRSWOR approximately reflects the split between attrited/existing customers.
-However, the sample only contains data for Blue and Silver card categories.
attr.srs <- srs.sample[srs.sample$Attrition_Flag=="Attrited Customer",]
table(attr.srs$Card_Category)
##
## Blue
## 14
All 14 samples picked are under the Blue category.Unable to validate the card-category with highest attrition.
hist(attr.srs$Months_on_book,main="Simple Random Sampling",xlab="Months-on-book",labels=TRUE,col="orange")
6/14 = 42% of the attrited customers were with the company - 35-40 months - similar to the findings from the actual data.
Systematic Sampling
Sample-size: 100
set.seed(2633)
grp <- ceiling(nrow(raw.data)/100) # divide into groups
r <- sample(grp,1) # pick first sample
sys.sample <- raw.data[seq(r, by=grp, length=100),]
paste("Fraction of Attrited vs Existing customers picked by Systematic Sampling")
## [1] "Fraction of Attrited vs Existing customers picked by Systematic Sampling"
table(sys.sample$Attrition_Flag)/nrow(sys.sample)
##
## Attrited Customer Existing Customer
## 0.19 0.80
We see that the samples picked through Systematic sampling approximately reflects the split between attrited/existing customers, but SRS yielded a closer split.
Split of card-categories in the sample selected -
table(sys.sample$Card_Category)
##
## Blue Gold Silver
## 91 2 6
attr.sys <- sys.sample[sys.sample$Attrition_Flag=="Attrited Customer",]
table(attr.sys$Card_Category)
##
## Blue Gold
## 18 1
hist(attr.sys$Months_on_book,main="Systematic Sampling",xlab="Months-on-book",labels=TRUE,col="purple")
7/12=58% of attrited customers were with the company for 35-40 months as per samples picked from systematic sampling.
Stratified sampling - using proportional sizes based on the Card-category
set.seed(2633)
sorted.data <- raw.data[order(raw.data$Card_Category),]
size <- round(100*table(raw.data$Card_Category)/nrow(raw.data))
paste(c("Size of samples proportional based on card-category - ",size))
## [1] "Size of samples proportional based on card-category - "
## [2] "93"
## [3] "1"
## [4] "0"
## [5] "6"
# Error Encountered in Strata : Error in data.frame(..., check.names = FALSE) :
# arguments imply differing number of rows: 0, 1
# This is due to one of the groups(platinum) having a size=0
# Overriding the size to replace 0 with 1 for platinum
size <- c(92,1,1,6)
strat <- strata(sorted.data,stratanames="Card_Category",size=size,
method="srswor")
strata.sample <- getdata(sorted.data,strat)
paste("Fraction of Attrited vs Existing customers picked by Stratified Sampling")
## [1] "Fraction of Attrited vs Existing customers picked by Stratified Sampling"
table(strata.sample$Attrition_Flag)/nrow(strata.sample)
##
## Attrited Customer Existing Customer
## 0.19 0.81
INFERENCE -
Split of card-categories in the sample selected -
table(strata.sample$Card_Category)
##
## Blue Gold Platinum Silver
## 92 1 1 6
attr.str <- strata.sample[strata.sample$Attrition_Flag=="Attrited Customer",]
table(attr.str$Card_Category)
##
## Blue Silver
## 18 1
The sample does not contain data for platinum/gold attrited customers.
hist(attr.str$Months_on_book,main="Stratified Sampling",xlab="Months-on-book",labels=TRUE,col="pink")
4/13=46% of attrited customers were with the company for 35-40 months as per samples picked from systematic sampling.
Attrition across the 4 card categories closely reflected the customer split across all customers. We can infer that attrition is across the card-categories and will need to be addressed.
About 41% of the customers who were with the company 35-40 months left the company.
We found that across all the income categories, majority of the customers owned a Blue card and very few customers owned a Silver/Gold card. This may be contributing to the attrition.
I would like to explore various sampling methods like Cluster Sampling and Stratified sampling using Systematic method.
I would also like to analyze details like customer’s education-level, marital-status and credit-limit to gather insights about how they may be contributing to attrition.